Wine Quality Exploratory Data Analysis

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). This analysis includes univariate, bivariate, and multivariate analysis around which chemical properties influence the quality of red wines.

Univariate Plots Section

structure of the data set

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There are 1599 observations of 13 numeric variables. X appears to be the index.

statistics of the variables

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

quality is an ordered, categorical, discrete variable. The values ranged only from 3 to 8, with a mean of 5.6 and median of 6. All other variables seem to be continuous quantities. fixed.acidity and volatile.acidity, also free.sulfur.dioxide and total.sulfur.dioxide may possible be dependent or subsets of each other.

main features of the data set

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

data set preview

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

distribution of wine quality

Quality ranges from 3 to 8, with a mean between 5 and 6.

adding new variables

Translate quality into categorial variable to look at relationship between other variables by quality grades for the analysis. Change type of the quality variable to factor and add it to the dataframe as a new variable quality.factor. In addition, 3 categories of quality was created- good (>= 7), bad (<=4), and medium (5 and 6).

fixed acidity

The distribution of fixed acidity is right skewed. The median is around 8 with high concentration of wines with Fixed Acidity but due to some outliers.

volatile acidity

The distribution of volatile acidity has two peaks around 0.4 and 0.6.

When plotted on a base 10 logarithmic scale, volatile.acidity appear to be normally-distributed.

citric acid

Apart from some outliers, the distribution of citric acid looks almost rectangular.

residual sugar

The distribution of Residual Sugar is how much sugar is left after fermentation, positively skewed with high peaks at around 2.3 with many outliers present at the higher ranges, median is around 2.3 with many outliers present at the higher ranges, 1st and 3rd quartile are within 1.9 and 2.6.

pH

pH has a normal distribution with most of the pH values between 3.2 and 3.4.

sulphates

Sulphates centers at 0.6, right skew all the way to 2.0

alcohol

Alcohol also follows a skewed distribution but here the skewness is less than that of Chlorides or Residual Sugars.

Distributions of the features investigated:

  • Normal: Quality, pH

  • Positively Skewed: Fixed acidity, Citric acid, Sulphates, Alcohol

  • Long Tail: Residual sugar

  • Bimodal: Volatile acidity

Bivariate Analysis

relationship between wine characteristics and quality

Increasing quality of wine when chemicals levels are higher

The boxplots above show all the cases when wine quality increases with increasing values of another variables.

decreasing value with higher wine quality

Cases when wine quality decreases while the values of variables increase.

correlations between fixed acidity with density and citric acid

The highest positive correlation is between density and fixed.acidity, as well as between fixed.acidity.

Multivariate Analysis and Plots

quality classification on sulphates and alcohol

Previous boxplots show increasing levels of both sulphates and alcohol increase with higher quality of red wine. The scatter plot above shows a combination of sulphates and alcohol to classify and distinguish wine quality levels. The plot reveals a clear pattern, showing most of green and yellow dots (high-quality) in the place where both alcohol and sulphates level are high. There is also a visible range of violet dots in the middle of the plot, and orange dots(low-quality) in the bottom-left corner. This implies that such a combination of variables distinguish between different levels of wines.

quality classification on fixed acidity and density

Although the plot is not very clear, it reveals some patterns. It is visible here that the majority of red, orange, and yellow dots (low quality) are concentrated in the lower left, while the majority of blue dots (high quality) are concentrated in bottom half of the plot. Lower quality wine seems to have higher density and lower fixed acidity where higher quality wine have lower fixed acidity and lower density.

quality classification on pH, total sulfur dioxide, and free sulfur dioxide

Influence of pH and sulfurdioxide on a quality of red wine, the left plot shows more high quality wines with lower pH and the opposite with low quality wine. The right plot shows high quality wines have higher free sulfur dioxide and totole sulfur dioxide and the opposite with low quality wine.

Final Plots and Summary

univariate analysis

Quality, pH, density are normally distributed. Fixed acidity, citric acid, total sulfur dioxide, free sulfur dioxide, sulphates, alcohol are positively Skewed Residual sugar, chlorides have long tail distribution Volatile acidity has bimodal distribution.

pH with density and fixed acidity

The correlation are negative between pH and densiy, pH and fixed acidity. Density and fixed acidity are lower when pH is higher.

quality classification by fiexed acidity and density

Although the plot is not very clear, it reveals some patterns. It is visible here that the majority of red, orange, and yellow dots (low quality) are concentrated in the lower left, while the majority of blue dots (high quality) are concentrated in bottom half of the plot. Lower quality wine seems to have higher density and lower fixed acidity where higher quality wine have lower fixed acidity and lower density.

Summary

  • Quality, pH have normal distribution.
  • Fixed acidity, citric acid, sulphates, alcohol are positively skewed.
  • Alcohol, sulphates, citric acid, fixed acidity are higher when wine quality is higher.
  • On the opposite, Volatile acidity, pH, density are lower when wine quality is higher.
  • The highest positive correlation is between density and fixed acidity.
  • Key factors that determine and drive wine quality are alcohol, sulphates, and acidity.

Reflections

In this exercise, my main struggle was to make various plots and get as few errors as possible. This requires lots of searching on Google, reviewing the lectures, and other research on related documents. Once I started to plot the data, the challenge was to have good understanding of each variable and to come up with which 2 or more variables should be included in the next analysis.

I feel more confident using R Studio after this exericise and I’m looking forward to do more analysis on R in the future. In future analysis, I’d like to explore more on doing corrilation and prediction after the EDA phase.